by Peter de Blanc + ChatGPT Deep Research
Posted to Adarie (www.adarie.com) on April 3, 2025
Content License: Creative Commons CC0 (No Rights Reserved)
Evaluating large language models (LLMs) on novel tasks (like game-playing) requires careful planning. This tutorial will guide you through designing a good evaluation ("eval"), preparing data, writing and running the eval, and sharing your results. We assume you have a GitHub account, basic programming knowledge, and familiarity with LLMs. By the end, you should be able to build a custom eval (e.g. testing an LLM's Go-playing skill using KataGo data) and publish it for others.
What Makes a Good Eval? A well-designed eval should provide meaningful insights about model performance. Generally, a good eval is one that:
Single vs. Multiple Evals (Scope): Decide if you need one comprehensive eval or separate evals for different skills. A broad eval (covering many sub-tasks) can give an overall picture, but might mix multiple metrics or be harder to interpret. Conversely, multiple smaller evals let you isolate specific capabilities (for example, one eval for strategy in game play, another for mathematical reasoning). If tasks are very different or require different metrics, it’s often better to create separate evals for clarity. On the other hand, if tasks are variations of a theme, grouping them into one eval can showcase combined performance. Consider maintenance too: updating a monolithic eval vs. several targeted ones.
Task Diversity vs. Specificity: There’s a trade-off in eval design between being broad or specific. A diverse eval (covering many contexts or sub-tasks) ensures the model isn’t overfitting to a narrow pattern and can reveal general robustness. However, very diverse evals might dilute focus – a model might do well on some parts and poorly on others, making it hard to pinpoint issues. A specific eval zooms in on one capability or scenario, providing detailed insight there, but might miss other weaknesses. In practice, you should align the scope with your goals: for a general model benchmark, diversity is key, while for testing a specific feature (like Go move prediction), specificity yields more actionable feedback. You can also do both: start with specific evals to diagnose particular skills, then combine them into a diverse suite for an overall assessment.
Clear Success Criteria: Before building anything, define what success looks like. Is it 90% accuracy on a quiz? Winning 50% of games against an engine? Formulate a clear goal so that the eval’s results are meaningful.
How Much Data Do You Need? The number of examples should be enough to yield statistically meaningful results, but not so large as to be unwieldy or expensive. The required size depends on the variability of the task and the differences you expect to measure. For quick iteration or demos, even a few dozen examples might suffice. For robust benchmarks, you might need hundreds or thousands. For instance, one community chess puzzle eval used 1000 puzzles to benchmark LLMs, which gave a solid basis for comparing models. If your eval will be used to detect small performance regressions, lean towards a larger sample size to reduce noise. Remember that API costs can accumulate with large evals (OpenAI’s eval framework even warns to be mindful of API usage costs), so strike a balance between thoroughness and cost.
Data Diversity and Quality: Ensure your dataset covers a range of scenarios relevant to your task. For example, if evaluating Go moves, include positions from different openings, middle-game and endgame situations, both typical and edge-case scenarios. Diversity helps test the model’s consistency. However, all examples should be quality-checked: errors in the dataset (wrong answers, ambiguous questions) will make your eval unreliable. If possible, have human experts review data or generate data with an LLM and then filter it using a stronger model or human feedback (an approach OpenAI and others have used to improve eval quality). It’s often better to have fewer high-quality examples than many noisy ones.
Data Licensing and Ethics: Always use data you have rights to. If you collect or create the examples yourself, you can choose how to license them. Common choices for open data are Creative Commons licenses like CC BY (requires attribution) or CC0 (public domain). If you’re repurposing existing data (e.g. game records, transcripts), check the source’s license or terms of use. KataGo data, for example, is open-source (the neural network weights are MIT licensed (
katagotraining.org - Neural Network License
), and many Go game records are public domain or CC-licensed). Still, always give credit to sources and ensure you’re not violating any terms. If your eval involves sensitive content (like user data or proprietary info), you might opt to keep that data private and only share aggregate results or a sanitized version.
Preparing Data Format: Once you have raw data, convert it into a convenient format for evaluation. Most eval frameworks use JSON Lines (JSONL) or CSV for datasets. JSONL is simply a text file where each line is a JSON object representing one example. For instance, an example could be:
{"problem": "2+2=", "answer": "4"}
Each line should contain all information needed for that test case: the prompt/input and the expected output or evaluation info. We’ll discuss prompt formatting next, but think ahead about what fields you need (e.g., for a question-answer task, you might have {"question": "...", "answer": "..."}
for each example).
Train vs Test in Evals: Some evals (especially those using few-shot prompting) include “train” examples in the prompt and then test the model on a “test” query. In such cases, you might prepare separate sets. For instance, the OpenAI Evals tutorial creates a tiny dataset with 2 training examples and 2 test examples to include the training examples as few-shot context (evals/docs/custom-eval.md at main · openai/evals · GitHub). Whether you separate train/test depends on your eval design. If each test case is independent (no in-prompt examples), you can just have one set of samples. If you plan to include some examples as part of the prompt (few-shot), you may need a way to distinguish or pair them with test prompts.
Designing the input prompt for each example is crucial, as it directly affects how the model responds.
Prompt Formatting: Choose a format that your target models handle well. For modern chat-based LLMs (OpenAI GPT-4, Anthropic Claude, Google Gemini, etc.), using a chat format with roles may yield the best results. In OpenAI’s eval framework, they encourage using the new chat format for prompts. For example, a chat prompt might be represented in JSON as:
{
"messages": [
{"role": "system", "content": "You are a Go expert."},
{"role": "user", "content": "Here's a Go board state: <state>. What is Black's best move?"}
],
"ideal": "The best move is D4."
}
Here, ideal
(or a similar field) would contain the expected/ideal answer. If using older models or simpler completion-style APIs, you might instead craft a single text string prompt (concatenating any context and question). OpenAI’s tooling can convert chat format to the older completion format if needed.
Few-Shot Examples: For tasks where the model benefits from seeing a few demonstrations, you can include few-shot examples in the prompt. This means prepending a few (input, output) pairs before asking the real question. For instance, if evaluating arithmetic, your prompt to the model might literally contain: "Q: 2+2=? A: 4\nQ: 4*4=? A: 16\nQ: 3+5=? A:"
and you expect the model to continue with the answer. These demonstrations can improve performance on the eval if the model uses in-context learning. Make sure the few-shot examples are representative and don't give away answers to the test queries. Include variations in format and content if possible, so the model learns the pattern but not a trivial mapping.
Structured Inputs (Games, Code, etc.): Sometimes inputs aren’t plain English. For example, to test Go or chess, you need to describe a board state. You have a few options:
.
for empty, X
for black, O
for white) in text. This is verbose but human-readable and maybe within the model’s understanding.In summary, format the prompt in a way that the model can parse. Provide any needed instructions as part of the prompt (e.g., “Output only the move, in coordinates.” to prevent the model from chatting). Consistency in formatting across examples is important so that a single eval logic can handle all cases.
Avoiding Prompt Leaks and Clues: Ensure your prompt for each test case doesn't accidentally reveal the expected answer. For example, if your data is in question-answer format, don’t include the answer in the prompt (unless it’s part of a few-shot example clearly separated from the test query). This seems obvious, but subtle leaks (like including the answer as a “hint”) can happen if you’re not careful.
After the model produces an output for each prompt, your eval needs to score it. Designing the right metric is as important as the prompt.
Exact Match vs. Partial Credit: If the task has a single correct answer (e.g., a math problem or a specific fact), you can use exact match scoring: the model gets 1 point for a correct answer and 0 for anything else. This yields an accuracy score over all examples. Exact match is simple and deterministic, but it can be too strict for open-ended tasks or those with multiple correct answers. In tasks like language translation or summarization, there isn't a single "correct" output. For those, use metrics that allow partial credit or similarity:
LLM-as-a-Judge: A modern approach is to use a strong LLM to grade the outputs of another (or same) model. For example, you might prompt GPT-4 with the model’s answer and the reference, asking for a score or verdict. Research has found that these model-based judgments can correlate well with human judgment. If you do this, design a clear rubric for the judging LLM (e.g., “Score 1 if the answer is correct, 0 if not. Here is the question, answer, and solution...”). Note that this introduces an extra dependency (the judge model might have its own biases or errors). It’s often good to spot-check some scoring outputs manually or include a few cases with known outcomes to verify the judge’s reliability.
External Tools for Scoring: In specialized domains, external tools or models can provide a ground truth or evaluation:
Deterministic vs. Probabilistic Outputs: Decide if your eval expects a deterministic output or if some randomness is acceptable. Ideally, for evals, make the model's outputs deterministic by controlling the generation settings. For instance, use temperature=0
in OpenAI API to minimize randomness so that each run of the eval is repeatable. If your eval absolutely requires sampling (say, evaluating the model’s creativity or diversity), then you might need to run multiple trials. In such cases, scoring could involve statistics (e.g., out of 5 sampled stories, 3 met the criteria). But this complicates things. As a rule of thumb, for straightforward evals, keep generation deterministic to get consistent scores run-to-run.
Reference Outputs: If you have a known correct answer for each example, store it in the data (like an "ideal"
or "expected"
field). Your eval code can then compare model output to this reference. If multiple answers are acceptable, you can store a list of acceptable answers or a pattern to match. For example, if any synonym of a word is fine, your scoring function could check if the model’s answer is in a set of synonyms. For numeric or structured outputs, you might compute a numeric error (e.g., difference between expected and model output if both are numbers). The key is implementing the scoring logic that aligns with what you consider "correct enough."
Custom Score Functions: In some eval frameworks (like OpenAI’s), you can write custom metrics in code. For example, you could parse the model’s output and give points for certain content. If doing this, thoroughly test your scoring function on sample outputs to ensure it behaves as expected (no false positives or negatives).
To summarize, design metrics that truly reflect success on your task. Keep them as simple as possible, but not simpler – they should capture critical aspects of performance. When in doubt, correlate automatic metrics with human judgment by reviewing a subset of outputs.
Now to the hands-on part: implementing your eval. You have two broad approaches:
We’ll outline a Python approach, as it's common, but the logic applies in any language.
5.1 Choosing a Framework or DIY: If you use OpenAI's Evals framework, much is handled for you (data loading, logging, etc.). You'd write a small amount of code or YAML configuration to define your eval. For example, an OpenAI eval can be configured in a YAML like:
evals:
- id: my_eval
metrics: [accuracy]
description: "Math addition questions"
handler: evals.elsuite.basic.match:Match
args:
samples_jsonl: my_eval/samples.jsonl
# any additional args like case_sensitive, etc.
Here handler
refers to a built-in class that checks if model output matches the reference (the Match
eval in this case). The samples_jsonl
points to your data file. This YAML essentially registers the eval. You would then run a CLI command to execute it.
If you want more customization (like complex scoring or multi-turn interaction), you might write a Python class inheriting from the eval framework’s base classes. OpenAI’s guide suggests focusing on existing templates if possible, resorting to code only for novel logic. Other frameworks like LM Evaluation Harness allow adding new tasks via Python classes or configs as well.
5.2 Writing a Simple Evaluation Script (Python): If not using a specialized framework, you can directly use model APIs. Here's a simplified pseudocode of how you might implement an eval in Python using OpenAI API (this can be adapted to Anthropic’s API, etc. by changing the client library):
import openai
import json
openai.api_key = "YOUR_API_KEY" # ensure this is set securely
# Load your eval samples
samples = [json.loads(line) for line in open("my_eval.jsonl")]
def evaluate_sample(sample):
prompt = construct_prompt(sample) # build the prompt text or messages from sample
# Call the model (assuming chat format for example)
response = openai.ChatCompletion.create(
model="gpt-4",
messages=prompt,
temperature=0 # deterministic
)
output = response["choices"][0]["message"]["content"]
score = score_output(sample, output) # e.g., 1 if matches sample["ideal"], else 0
return score, output
results = []
for sample in samples:
sc, out = evaluate_sample(sample)
results.append({"sample": sample, "output": out, "score": sc})
# Compute aggregate metrics, e.g. accuracy
accuracy = sum(r["score"] for r in results) / len(results)
print(f"Accuracy: {accuracy:.2%}")
In the above:
construct_prompt(sample)
is a function you write to format the sample into the model input (it might use sample["question"]
to form a message list or text).score_output(sample, output)
is your scoring logic, e.g., compare to sample["ideal_answer"]
.For Anthropic’s Claude, you’d use their client (anthropic.Client.completion()
) with the prompt in their format (they use a system prompt with human/assistant roles). For Google’s models, their API might differ. The principle remains: send prompt, get output, compare to reference.
TypeScript Implementation: You can do the same in Node.js using fetch or official SDKs. For example, using OpenAI’s Node SDK:
import { Configuration, OpenAIApi } from "openai";
const config = new Configuration({ apiKey: process.env.OPENAI_API_KEY });
const openai = new OpenAIApi(config);
const samples = JSON.parse(fs.readFileSync("my_eval.json")); // assuming it's an array in a JSON
for (const sample of samples) {
const prompt = constructPrompt(sample);
const response = await openai.createChatCompletion({
model: "gpt-4",
messages: prompt,
temperature: 0
});
const output = response.data.choices[0].message.content;
const score = scoreOutput(sample, output);
// ...collect results
}
TypeScript can also be useful if you want to integrate with a web interface or use browser-based eval (for example, if evaluating a model through a web UI or using a browser automation to test something).
API Access vs. Browser-Based Evaluation: In most cases, using direct API calls (as above) is appropriate. This is true for evaluating pure LLM behavior. However, if your eval involves the model interacting with a browser (say you’re evaluating a browsing agent plugin that looks up information), then a different approach is needed. You might have to automate a browser (using something like Playwright or Selenium) to simulate user interactions and capture model outputs in a web interface. That’s quite advanced and usually not necessary unless evaluating a full agent with tools. Another angle: some evaluation frameworks integrate with browser for collecting human feedback or for certain web-based tasks, but for our focus (LLM capabilities via API), sticking to API is simpler and more reproducible.
Ensuring Multi-Model Compatibility: To test across OpenAI, Anthropic, etc., you might abstract the model interface in your code. For example, have a function generate_response(model_name, prompt)
that internally calls the appropriate API depending on model_name
. Many community frameworks (like the EleutherAI harness) do this, supporting a variety of models with a unified interface. You can also use libraries like Hugging Face’s transformers
to load local models or call API endpoints, but be mindful of rate limits and format differences. The goal is to avoid writing completely separate eval code for each model vendor. Instead, parameterize the model choice.
Sample Data and Code Structure: Organize your eval repository with clarity:
data/my_eval/
) containing your JSONL or JSON data.Keeping things well-structured will also help when publishing (others can understand and reproduce your eval).
Once implementation is ready, it’s time to run your eval and gather results.
Batching and Efficiency: If you have many examples, calling the API one-by-one can be slow and hit rate limits. Many APIs allow batching or streaming. For example, OpenAI’s API supports submitting prompts as a list for some endpoints, or you can use asynchronous calls. Some evaluation libraries automatically batch requests for speed. If writing from scratch, consider using Python’s asyncio
or concurrent futures to send multiple requests in parallel (respect the API’s concurrency limits). Batch processing can dramatically speed up eval runs, especially for hundreds of examples.
Local vs. Cloud Execution: Decide where to run the eval:
Debugging Tips: It’s common for something to go wrong on the first run. Here’s how to troubleshoot:
Collecting Results: Decide on an output format for results. You might simply print metrics to screen. But it’s often useful to save detailed results (each example’s output and score) to a file (JSON or CSV). This allows analysis later, especially if you want to see which examples failed. OpenAI’s evals framework, for instance, can log to a JSONL or even a database, and also integrates with Weights & Biases for tracking runs. If you plan multiple eval runs (e.g., comparing models), systematically save them (naming files by model and date, for example).
Interpreting Output: After a successful run, interpret the results carefully. Look beyond the top-line metric. For example, if accuracy is 70%, which 30% did it get wrong? Are they clustered in a certain category of input? Perhaps all failures are endgame positions in Go, indicating a weakness there. This analysis might lead you to refine your eval (maybe add more examples of that type or split out a sub-metric).
Iterate if Needed: It’s not uncommon to iterate on the eval design after seeing initial results. If your eval was too easy (all models score 100%), you might add harder cases. If it was too hard or ambiguous (all models score near 0, including ones you expected to do well), inspect if the questions are fair or if scoring is too harsh. The eval should ideally differentiate model performance in a useful range (not all 0s or all 100s).
You’ve built and run your custom eval – now consider sharing it so others can benefit or replicate it.
Open-Source the Code and Data: The best practice is to create a repository (on GitHub or similar) containing:
.jsonl
or whatever format) for the eval dataset. If the dataset is large, you could host it on a data hub (Hugging Face Datasets, Kaggle, etc.) and just provide a link or script to download.LICENSE
file and/or in the README. This helps avoid any legal ambiguity.Publishing on Eval Platforms: In addition to your own repo, you might contribute your eval to established platforms:
openai/evals
repository. However, note the current guidance: they are not accepting evals with custom code at the moment, only certain YAML-based evals. This may change over time. If your eval fits their criteria (they look for evals that surface interesting capabilities or problems), you could submit a Pull Request to add it. This makes it visible to anyone using OpenAI’s framework.Licensing and Permissions: Double-check any external data you included. If you incorporated data from somewhere (like game records, or KataGo’s outputs), mention the source and license in your repo. For instance, “Game positions taken from XYZ database, ©2023 ABC (used under CC BY 4.0 license).” If your eval might reveal sensitive information or you’re not sure about sharing certain pieces, consider anonymizing or aggregating. For example, if you had a proprietary dataset you can’t share, you might still share the scoring logic and perhaps a few example data points, so others could at least follow the method with their own data.
Privacy/Keeping Parts Private: If some parts must remain private (due to business or privacy reasons), you have options:
Community Engagement: Announce your eval on relevant forums or communities (the OpenAI community forum, Discord channels, Reddit, etc.). You might get feedback, or others might run their models on it and share results. This can be very insightful – maybe someone finds a model that does unexpectedly well or identifies a flaw in one of your examples.
Continuous Updates: If your eval becomes popular or if LLMs improve, you might update it. For instance, if models start solving all your Go positions easily, you’d want to add harder ones. Treat an eval as a living benchmark that can evolve. Just be sure to version it (so results from v1 vs v2 aren’t confused).
Finally, when publishing, articulate why the eval is useful. For example: “This eval tests strategic planning in Go. It’s challenging for GPT-4-class models and can help identify whether new models have improved in long-term planning.” Clear motivation will attract users (and possibly contributors) to your eval.
Let’s put it all together with an example eval: assessing an LLM’s skill at Go (the board game) by using KataGo (a strong Go engine) as a reference. Our goal will be to see if an LLM can suggest good moves for a given Go board state.
You are an expert Go player. Analyze the board and suggest the best move for the player to move.
Board state:
- Black: D4, Q16, ...
- White: C3, D16, ...
It is Black's turn.
What is Black's best move?
We might not include few-shot examples because describing one board and then another might confuse things. We’ll rely on the instruction and the board listing being clear.D4
. So our ideal
answer for that example is "D4"
. If multiple moves are nearly equal (within say 1% win rate), we could allow any of those as correct (store a list like ["D4","C16"]
if applicable).We create a file go_eval.jsonl
with each line like:
{
"board": "Black: D4, Q16, K10, ...; White: C3, D16, Q4, ...; turn: Black",
"ideal": "J4"
}
This represents one test case (with a truncated list of stones for brevity). Ensure all coordinates are valid and the board state is plausible (no overlaps, etc.). We include 50 such lines, each with a unique board and ideal
best move.
(If we had KataGo's full analysis, we could include win-rate info or secondary moves, but for simplicity we just put the top move.)
We'll use the OpenAI API with gpt-4
for this eval (assuming it has been trained on enough Go-related text to have some idea). We set temperature to 0 for consistency.
import openai, json
openai.api_key = "YOUR_API_KEY"
def format_go_prompt(board_desc, player_to_move):
return [
{"role": "system", "content": "You are an expert Go player and teacher."},
{"role": "user", "content": f"Analyze the following Go board and suggest the best move for {player_to_move}.\nBoard state:\n{board_desc}\nIt is {player_to_move}'s turn. What is the best move for {player_to_move}?"}
]
# Load eval data
with open("go_eval.jsonl") as f:
samples = [json.loads(line) for line in f]
correct = 0
for sample in samples:
# Each board description contains whose turn, but let's parse it out or store separately
board_desc = sample["board"]
# assume board string contains "turn: X" at end:
player_to_move = "Black" if "turn: Black" in board_desc else "White"
messages = format_go_prompt(board_desc, player_to_move)
resp = openai.ChatCompletion.create(model="gpt-4", messages=messages, temperature=0)
answer = resp["choices"][0]["message"]["content"].strip()
# Simple parsing: take the first two characters that look like a coordinate
move = answer.split()[0] # this might grab "D4" even if model said "D4 is the best move."
if move.upper().strip(",.") in sample["ideal"].split("/"):
# If we allowed multiple ideal moves separated by "/", check those
correct += 1
else:
print(f"Board: {board_desc}\nModel answer: {answer} (expected {sample['ideal']})\n")
After running this, correct
will be the number of times the model’s move matched KataGo’s top move. The print statement will output cases the model got wrong, for analysis.
Suppose we run this and get output like:
We might refine the eval:
We rerun after tweaks, log the final results.
Finally, we prepare a summary:
We then share the go_eval.jsonl
and the script in a GitHub repo, along with these findings. Perhaps also try a couple of other models (Claude, etc.) by adjusting the code, and include those results in our README for context.
By going through this example, we demonstrated the full process: designing an eval for a novel capability (game-playing), implementing it, running it, and analyzing outcomes. You can follow similar steps for other custom evals – whether it’s testing coding ability, logical puzzles, multi-modal reasoning, or anything else you can devise. Good luck with building your own LLM evals, and we look forward to seeing what creative benchmarks you come up with!